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Abstract 

Most learning methods with rank or spar- 
sity constraints use convex relaxations, which 
lead to optimization with the nuclear norm 
or the €i-norm. However, several important 
learning applications cannot benefit from this 
approach as they feature these convex norms 
as constraints in addition to the non-convex 
rank and sparsity constraints. In this setting, 
we derive efficient sparse projections onto 
the simplex and its extension, and illustrate 
how to use them to solve high-dimensional 
learning problems in quantum tomography, 
sparse density estimation and portfolio selec- 
tion with non-convex constraints. 



1. Introduction 

We study the following sparse Euclidean projections: 

Problem 1. (Simplex) Given w g W, find a Eu- 
clidean projection of w onto the intersection of k- 
sparse vectors S fe = {(3 G R p : \{i : ft ^ 0}| < k} 
and the simplex A+ = {[3 e R p : ft > 0, J2i Pi = A } ; 

■p(w) e argmin ||/3 — w|| 2 . (1) 
/3:/3es fc nA+ 

Problem 2. (Hyperplane) Replace in (1) with the 
hyperplane constraint A\ = j/3 € M. p : ft = A}. 

We prove that it is possible to compute such projections 
in quasilinear time via simple greedy algorithms. 

Our motivation with these projectors is to address 
important learning applications where the standard 
sparsity/low-rank heuristics based on the l\ /nuclear- 
norm are either given as a constraint or conflicts 
with the problem constraints. For concreteness, we 
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highlight quantum tomography, density learning, and 
Markowitz portfolio design problems as running exam- 
ples. We then illustrate provable non-convex solutions 
to minimize quadratic loss functions 

f{f3):=\\y-A{(3)\\ 2 (2) 

subject to the constraints in Problem 1 and 2 with our 
projectors. In (2), we assume that y e W n is given 
and the (known) operator A, : M. p — > K m is linear. 

For simplicity of analysis, our minimization approach 
is based on the projected gradient descent algorithm: 

/3 l+1 =7>(/3 l -^V/(/3 1 )), (3) 

where f3 l is the i-th iterate, V/(-) is the gradient of 
the loss function, \i % is a step-size, and V(-) is based on 
Problem 1 or 2. When the linear map A. in (2) pro- 
vides bi-Lipschitz embedding for the constraint sets, 
we can derive rigorous approximation guarantees for 
the algorithm (3); cf., (Garg & Khandekar, 2009). 1 

To the best of our knowledge, explicitly sparse Eu- 
clidean projections onto the simplex and hyperplane 
constraints have not been considered before. The clos- 
est work to ours is the paper (Kyrillidis & Cevher, 
2012). In (Kyrillidis & Cevher, 2012), the authors pro- 
pose an alternating projection approach in regression 
where the true vector is already sparse and within a 
convex norm-ball constraint. In contrast, we consider 
the problem of projecting an arbitrary given vector 
onto convex-based and sparse constraints jointly. 

At the time of this submission, we become aware of 
(Pilanci et al., 2012), which considers cardinality reg- 
ularized loss function minimization subject to simplex 
constraints. Their convexified approach relies on solv- 
ing a lower-bound to the objective function and has 
0(p 4 ) complexity, which is not scalable. We also note 
that regularizing with the cardinality constraints is 
generally easier: e.g., our projectors become simpler. 

1 Surprisingly, a recent analysis of this algorithm along 
with similar assumptions indicates that rigorous guaran- 
tees can be obtained for minimization of general loss func- 
tions other than the quadratic (Bahmani et al., 2011). 
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Notation: Plain and boldface lowercase letters rep- 
resent scalars and vectors, resp. The i-th entry of a 
vector w is u>;, and [m;] + = max(u>;, 0), while f3 t is the 
model estimate at the z-th iteration of an algorithm. 
Given a set S C J\f = {1, . . . ,pj, the complement S c 
is defined with respect to Af, and the cardinality is 
|<S|. The support set of w is supp(w) = {i : Wi ^ 0}. 
Given a vector w G R p , is the projection (in M p ) 
of w onto S, i.e. (ws) Sc = 0, whereas W| s € R' 5 ' is 
w limited to S entries. The all-ones column vector is 

1, with dimensions apparent from the context. We de- 
fine Efc as the set of all fc-sparse subsets of AT, and we 
sometimes write j3 G £fc to mean supp(/3) € The 
trace of a matrix X is written tr(X). 

2. Preliminaries 

Basic definitions: Without loss of generality, as- 
sume w is sorted in descending order, so w\ is the 
largest element. We denote V\+ for the (convex) Eu- 
clidean projector onto the standard simplex At", and 
V\ for its extension to A\. The (non-convex) Eu- 
clidean projector onto the set is Ps k , which retains 
the fc-largest in magnitude elements. In contrast to 
V\, the projection V-% h need not be unique. 

Definition 2.1 (Operator pL k )- We define pL k (yv) as 
the operator that keeps the k-largest entries o/w (not 
in magnitude) and sets the rest to zero. This operation 
can be computed in 0(pmm(kAog(p)))-time. 

Definition 2.2 (Euclidean projection P\+). The pro- 
jector onto the simplex is given by 

(V\+ (w))i = [uii - t]+, where r := - [ ly* — A ) 

for p := maxjj : Wj > j(J2l=i w i ~ A )}- 
Definition 2.3 (Euclidean projection P\). The pro- 
jector onto the extended simplex is given by 



(V\(w))i = Wi 



whe 




- A 



Definition 2.4 (Restricted isometry property (RIP) 
(Candes et al., 2006)). A linear operator A. : 1Z P —> 
lZ m satisfies the k-RIP with constant 5k G (0, 1) if 

1 - 4 < \\A(f3)\\i/\\P\\ 2 2 < 1 + S k , V/3 g E fc . (4) 

Guarantees of the gradient scheme (3): Let 
y = A(/3*) + e G R m , (m <p), be a generative model 
where e is an additive perturbation term and /3* is 
the /c-sparse true model generating y. If the RIP as- 
sumption (4) is satisfied, then the projected gradient 



descent algorithm in (3) features the following invari- 
ant on the objective (Garg & Khandckar, 2009): 



/(/3 l+1 ) < 



2<) 



2 A- 



1 - hh 



/(/3 l ) + ci||e|| 2 , 



(5) 



for ci > and stepsize (/ = ■ Hence, for 

$2k < 1/3, the iterations of the algorithm are con- 
tractive and (3) obtains a good approximation on the 
loss function. In addition, (Foucart, 2010) shows that 
we can guarantee approximation on the true model via 



1/3 



i+l 



/3*||2<2(5 3fc ||/3 4 -/3* 



c 2 £ 2, 



(6) 



for c\ > and /j, 1 = 1. Similarly, when 63k < 1/2, 
the iterations of the algorithm are contractive. Differ- 
ent step size // strategies result in different guaran- 
tees; c.f., (Foucart, 2010; Garg & Khandekar, 2009; 
Kyrillidis & Cevher, 2011) for a more detailed dis- 
cussion. Note that to satisfy a given RIP constant 
5, random matrices with sub-Gaussian entries require 
m = O (k \og{p/k) /S 2 ) . In low rank matrix cases, sim- 
ilar RIP conditions for (3) can be derived with approx- 
imation guarantees; cf., (Meka et al., 2010). 

3. Underlying discrete problems 

Let [3* be a projection of w onto T,k l~l A^ or Efc n A\. 
We now make the following elementary observation: 

Remark 1. The Problem 1 and 2 statements can be 
equivalently transformed into the following nested min- 
imization problem: {<S*,/3*} = 



argmin 



argmin 

/3:/3 s GA+ or A, 
i3 s c=0 



||(/3-w)|J! + ||w| s J 



where supp(/3*) = S* and j3* G A^ or A 



a ■ 



Therefore, given S* = supp(/3*), we can find [3* 
by projecting W5* onto A^ or Aa within the k- 
dimensional space. Thus, the difficulty is finding S*. 
Hence, we split the problem into the task of finding the 
support and then finding the values on the support. 

3.1. Problem 1 

Given any support <S, the unique corresponding esti- 
mator is (3\ s = P\+(w\ s )- We conclude that (3* satis- 
fies (3* \s*y = an< l fi*\s* = ^ , a+( w |5*), and 

S* G argmin ||7 5 A +(w| 5 ) - W| s ||| + || W|^.^ ||| 

5:5eS fc 



argmax F + (S) 



(J) 



where F + (S) := H ~ ((^A+(w, s )) i - wtf). 
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Algorithm 1 GSSP 



1: S* = supp(7\ fc (w)) 

2: /3| 5 , =PA+(W| 5 .),/3| (5 , 



{Select support} 
= {Final projection} 



Algorithm 2 GSHP 

1: i = 1 , S = j, j € argmaxi [Att;*] {Initialize} 
2: Repeat: £ «- £ + 1, 5 «- 5 U j, where 



j € argmax i&A /\S 
3: Until £ = k, set 5* <- 5 
4: /3| S . = P A (W| S J, /3 |(st) 



{Grow} 

{Terminate} 
= {Final projection} 



This set function can be simplified to 



F+(S) 



ies 



K 2 



(8) 



where t (which depends on S) is as in Lemma 1. 

Lemma 1. Let (3 = F x+ (w) where = [tVi — r]+. 
Then, uii > r for all i <E S = supp(/3). Furthermore, 



h (E 



[51 



A). 



Proof. Directly from the definition of r in Definition 
2.2. The intuition is quite simple: the "threshold" r 
should be smaller than the smallest entry in the se- 
lected support, or we unnecessarily shrink the coef- 
ficients that are larger without introducing any new 
support to the solution. Same arguments apply to in- 
flating the coefficients to meet the simplex budget. □ 

3.2. Problem 2 

Similar to above, we conclude that 3* satisfies = 
T J x(w\ st ) and /3|/ s *n c = 0, where 

5* G argmin ||z — w|| 2 = argmaxF(S) (9) 

5:Ses fc SiSeSfc 

where z G M p with zi_ = ^(wu) and Z| 5c = and 



4. Sparse projections onto A^ and Aa 

Algorithm 1 below suggests an obvious greedy ap- 
proach for the projection onto n A^. We select the 
set S* by naively projecting w as P ifc (w). Remark- 
ably, this gives the correct support set for Problem 1, 
as we prove in Section 5.1. We call this algorithm 
the greedy selector and simplex projector (GSSP). The 
overall complexity of GSSP is dominated by the sort 
operation in p-dimensions. 

Unfortunately, the GSSP fails for Problem 2. As a 
result, we propose Algorithm 2 for the Efe n A\ case 



which is non-obvious. The algorithm first selects the 
index of the largest element that has the same sign 
as A. It then grows the index set one at a time by 
finding the farthest element from the current mean, as 
adjusted by lambda. Surprisingly, the algorithm finds 
the correct support set, as we prove in Section 5.2. We 
call this algorithm the greedy selector and hyperplanc 
projector (GSHP), whose overall complexity is similar 
to GSSP. 

5. Main results 

Remark 2. When the symbol S is used as S = 
supp(/3) for any (3, then if \S\ < k, we enlarge S until 
it has k elements by taking the first k — \S\ elements 
that are not already in S, and setting (3 = on these 
elements. The lexicographic approach is used to break 
ties when there are multiple solutions. 

5.1. Correctness of GSSP 

Theorem 1. Algorithm 1 exactly solves Problem 1. 

Proof. Intuitively, the fc-largest coordinates should be 
in the solution. To see this, suppose that u is the 
projection of w. Let Wi be one of the fc-(most positive) 
coordinates of w and = 0. Also, let w,j < Wi, i ^ 
j such that Uj > 0. We can then construct a new 
vector u' where u'j = Ui = and v! i = Uj. Therefore, 
u' satisfies the constraints, and it is closer to w, i.e., 

2uj(wi — Wj) > 0. Hence, u 



,/||2 



cannot be the projection. 

To be complete in the proof, we also need to show that 
the cardinality k solutions are as good as any other 
solution with cardinality less than k. Suppose there 
exists a solution u with support |«S| < k. Now add any 
elements to S to form S with size k. Then consider 
w restricted to S, and let u be its projection onto the 
simplex. Because this is a projection, ||u^ — W|^|| < 

u — w||. □ 



W|j||, hence ||u — w|| < 



5.2. Correctness of GSHP 

Theorem 2. Algorithm 2 exactly solves Problem 2. 

Proof. To motivate the support selection of GSHP, we 
now identify a key relation that holds for any b € R fe : 



5> 2 



A(26 1 -A) + V^f6,-ffi4^) 2 . (10 ) 



By its left-hand side, this relation is invariant under 
permutation of b. Moreover, the summands in the sum 
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over k are certainly non- negative for k > 2, so without 
loss of generality the solution sparsity of the original 
problem is ||/3*||o = k. For k = 1, F is maximized by 
picking an index i that maximizes Am;, which is what 
the algorithm does. 

For the sake of clarity (and space), we first describe 
the proof of the case k > 2 for A = and then explain 
how it generalizes for A =/= 0. In the sequel, let us use 
the shortcut avg(5) = j^y J2jes w j- 

Let 5 be an optimal solution index set and let I be 
the result computed by the algorithm. For a proof (of 
the case k > 2, A = 0) by contradiction, assume that I 
and S differ. Let e be the first element of I\S in the 
order of insertion into I by the algorithm. Let e' be 
the element of S\Io that lies closest to e. Without loss 
of generality, we may assume that w e ^ w e > , otherwise 
we could have chosen 5\{e'} U {e} rather than S as 
solution in the first place. Let Iq C IDS be the indices 
added to I by the algorithm before e. Assume that Jo 
is nonempty. We will later sec how to ensure this. 

Let a := avg(Jo) and a' := avg(5\{e'}). There are 
three ways in which w e , w e i and a' can be ordered 
relative to each other: 

1. e' lies between e and a', thus \w e > — a'\ < \w e — a'\ 
since w e ^ w e > . 

2. a' lies between e and e'. But then, since there are 
no elements of S between e and e', S\I moves the 
average a' beyond a away from e towards e', so 
\w e i — a'\ < \w e i — a\ and \w e — a\ < \w e — a'\. 
But we know that \w e i — a\ < \w e — a\ since 



argmax ie/Q |Wi 



a | by the choice of the greedy 



algorithm and w e ^ w e >. Thus \w e 



-a < \w K 



3. \w e — a'\ < \w e ' — a |, i.e., e lies between a' and 
e'. But this case is impossible: compared to a, a' 
averages over additional values that are closer to a 
than e, and e' is one of them. So a' must be on the 
same side as e' relative to e, not the opposite side. 

So \w e > —a'\< \w e — a' | is assured in all cases. Note in 
particular that if | S\ > 1, \wi — avg(S')| 9 \w 3 — avg(S')|, 
then 

F(S U W) = F(S) + ^ ( Wi - wg(S)" 2 



k 

= F(SU{j}), 



(11) 



where 9 is either '=' or '<'. By inequality (11), F(S) < 
F((S\{e'}) U {e}). But this means that S is not a 
solution: contradiction. 



We have assumed that I is nonempty; this is ensured 
because any solution S must contain at least an index 
i € argmax^ Wj. Otherwise, we could replace a max- 
imal index w.r.t. w in S by this i and get, by (11), 
a larger F value. This would be a contradiction with 
our assumption that S is a solution. Note that this 
maximal index is also picked (first) by the algorithm. 
This completes the proof for the case A = 0. Let us 
now consider the general case where A is unrestricted. 

We reduce the general problem to the case that A = 0. 
Let us write F w \ to make the parameters w and A 
explicit when talking of F. Let w[ t :— — A for one 
i* for which Xwi- is maximal, and let w[ := Wi for all 
other i. We use the fact that, by the definition of F, 

F WjX (S) = 2Xw' i , +\ 2 + F w ,, Q (S) 

when S contains such an element i* € argmax^ (Xwj ■) . 
Clearly, i* is an extremal element w.r.t. w and u>i* has 
maximum distance from —A, so 



€ argmax 



Wj 



E 



A 



J 



By (10), i* must be in the optimal solution for F Wj \. 
Also, F w ifi(S) and 2Xw' i ,+X 2 +F w >_ (S) are maximized 
by the same index sets S when i* e S is required. 
Thus, 

argmax F w \(S) = argmax F w * (S). 
S s-.jes 

Now observe that our previous proof for the case A = 
also works if one adds a constraint that one or more 
indices be part of the solution: If the algorithm com- 
putes these elements as part of its result /, they are 
in Iq = I n S. But this is what the algorithm does on 
input (u>, A); it chooses i* in its first step and then pro- 
ceeds as if maximizing F w \q. Thus we have established 
the algorithm's correctness. □ 

6. Application: Quantum tomography 

Problem: In quantum tomography (QT), we aim to 
learn a density matrix X* € C dxd , which is Hermitian 
(i.e., (X*) H = X*), positive semi-definite (i.e., X* y 
0) and has rank(X*) = r and tr(X*) = 1. The QT 
measurements are y = ,4.(X*) + 77, where (^4(X*)). ; = 
tr(E^X*) + r/i, and rji is zero-mean Gaussian. The 
operators 's are the tensor product of the 2x2 Pauli 
matrices (Liu, 2011). 

Recently, (Liu, 2011) showed that almost all 
such tensor constructions of 0(rd log 6 d) Pauli 
measurements satisfy the so-called rank-r re- 
stricted isometry property (RIP) for all X e 
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{X G C dxd : X h 0,rank(X) < r, ||X||, < ^||X|| F }: 
(1 - S r ) ||X|| 2 F < \\A(X)f F < (1 + S r ) ||X|| 2 F , (12) 

where || • ||* is the nuclear norm (i.e., the sum ol singu- 
lar values), which reduces to tr(X) since X y 0. This 
key observation enables us to leverage the recent the- 
oretical and algorithmic advances in low-rank matrix 
recovery from a few affine measurements. 

The standard matrix-completion based approach to re- 
cover X* from y is the following convex relaxation: 

minhnize||.4.(X)-y|| F + A||X|| st . (13) 

This convex approach is both tractable and amenable 
to theoretical analysis (Gross ct al., 2010; Liu, 2011). 
As a result, we can provably reduce the number of 
samples m from 0{d 2 ) to 6{rd) (Liu, 2011). 

Unfortunately, this convex approach fails to account 
for the physical constraint ||X||* = 1. To overcome 
this difficulty, the relaxation parameter A is tuned to 
obtain solutions with the desired rank followed by nor- 
malization to heuristically meet the trace constraint. 

In this section, we demonstrate that one can do signif- 
icantly better via the non-convex algorithm based on 
(3). A key ingredient then is the following projection: 

B G argmin ||B - W||| s.t. rank(B) = r, tr(B) = 1, 

B^O 

(14) 

for a given Hermitian matrix W G M. nxn . Since the 
RIP assumption holds here, we can obtain rigorous 
guarantees based on a similar analysis to (Foucart, 
2010; Garg & Khandekar, 2009; Meka et al., 2010). 

To obtain the solution, we compute the eigenvalue de- 
composition W = UAwU H and then use the unitary 
invariance of the problem to solve D* G argmin D ||D — 
Aw||f subject to ||D||* < 1 and rank(D) < r, and 
from D* form UD*!^ to obtain a solution. In fact, 
we can constrain D to be diagonal, and thus reduce the 
matrix problem to the vector version for D = diag(d), 
where the projector in Problem 1 applies. This reduc- 
tion follows from the well-known result: 

Proposition 6.1 ((Mirsky, I960)). Let A, B G R mx ™ 

and q = min{rn,n}. Let <7j(A) be the singular values 
of A in descending order (similarly for H). Then, 

^(a l (A)- ( r 4 (B)) 2 <||A-B|| F . 

i=l 

Equation (14) has a solution if r > 1 since the con- 
straint set is non-empty and compact (Weierstrass's 
theorem). As the vector reduction achieves the lower 
bound, it is an optimal projection. 
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Figure 1. Quantum tomography with 8 qubits and 30 dB 
SNR: Each point is the median over 10 random realizations. 
Convex approach 1 refers to (13) and approach 2 is (15). 
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Figure 2. Same as Figure 1 but with 7 qubits, no noise. 

Numerical experiments: We numerically demon- 
strate that the ability to project onto trace and rank 
constraints jointly can radically improve recovery even 
with the simple gradient descent algorithm as in (3). 
We follow the same approach in (Gross et al., 2010): 
we generate random Pauli measurements and add 
white noise. The experiments that follow use a 8 qubit 
system (d = 2 s ) with noise SNR at 30 dB (so the ab- 
solute noise level changes depending on the number of 
measurements), and a 7 qubit noiseless system. 

The measurements are generated using a random real- 
valued matrix X* with rank 2, although the algorithms 
also work with complex- valued matrices. A d x d rank 
r real- valued matrix has dr — r(r — l)/2 ~ dr degrees 
of freedom, hence we need at least 2dr number of mea- 
surements to recover X* from noiseless measurements 
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(due to the null-space of the linear map). To test the 
various approaches, we vary the number of measure- 
ments between 2dr and 5dr. We assume r is known, 
though other computational experience suggests that 
estimates of r return good answers as well. 

The convex problem (13) depends on a parameter A. 
We solve the problem for different A in a bracketing 
search until we find the first A that provides a solution 
with numerical rank r. Like (Flammia et al., 2012), we 
normalize the final estimate to ensure the trace is 1. 
Additionally, we test the following convex approach: 



minimize 

XX0,||X|L<1 



|A(X) 



ylll- 



(15) 



Compared to (13), no parameters are needed since we 
exploit prior knowledge of the trace, but there is no 
guarantee on the rank. Both convex approaches can 
be solved with proximal gradient descent; we use the 
TFOCS package (Becker et al., 2011) since it uses a 
sophisticated line search and Nesterov acceleration. 

To illustrate the power of the combinatorial projec- 
tions, we solve the following non-convex formulation: 



minimize 
X>0,||X||,<l,rank(X)= 



||.4(X)-y|||. 



(16) 



Within the projected gradient algorithm (3), we use 
the GSSP algorithm as described above. The stepsize 
is /i l = 3/ 1 1 A\ | 2 where || • || is the operator norm; we 
can also apply Nesterov acceleration to speed conver- 
gence, but we use (3) for simplicity. Due to the non- 
convexity, the algorithm depends on the starting value 
X . We try two strategies: (i) random initialization, 
and (ii) initializing with the solution from (15). Both 
initializations often lead to the same stationary point. 

Figure 1 shows the relative error ||X — X*||f /||X*||jr 
of the different approaches. All approaches do poorly 
when there are only 2dr measurements since this is 
near the noiseless information-theoretic limit. For 
higher numbers of measurements, the non-convex ap- 
proach substantially outperforms both convex ap- 
proaches. For 2Adr measurements, it helps to start 
Xo with the convex solution, but otherwise the two 
non-convex approaches are nearly identical. 

Between the two convex solutions, (13) outperforms 
(15) since it tunes A to achieve a rank r solution. 
Neither convex approach is competitive with the non- 
convex approaches since they do not take advantage of 
the prior knowledge on trace and rank. 

Figure 2 shows more results on a 7 qubit problem with- 
out noise. Again, the non-convex approach gives bet- 
ter results, particularly when there are fewer measure- 
ments. As expected, both approaches approach perfect 
recovery as the number of measurements increases. 



Approach 


mean time 


standard deviation 


convex 


0.294 s. 


0.030 s. 


non-convex 


0.192 s. 


0.019 s. 



Table 1. Time per iteration of convex and non-convex ap- 
proaches for quantum state tomography with 8 qubits. 



Here we highlight another key benefit of the non- 
convex approach: since the number of eigenvectors 
needed in the partial eigenvalue decomposition is at 
most r, it is quite scalable. In general, the convex ap- 
proach has intermediate iterates which require eigen- 
value decompositions close to the full dimension, espe- 
cially during the first few iterations. Table 1 shows av- 
erage time per iteration for the convex and non-convex 
approach (overall time is more complicated, since the 
number of iterations depends on linesearch and other 
factors). Even using Matlab's dense eigenvalue solver 
eig, the iterations of the non-convex approach are 
faster; problems that used an iterative Krylov sub- 
space solver would show an even larger discrepancy. 2 

7. Application: Measure learning 

Problem: We study the kernel density learning set- 
ting: Let xW, xW, . . . , x(") G R p be an n-size cor- 
pus of p-dimensional samples, drawn from an unknown 
probability density function (pdf) /i(x). Here, we will 
form an estimator /i(x) := Y^,7=i A' t cr(x,xW), where 
K(t(x, y) is a Gaussian kernel with parameter a. Let us 
choose /t(x) to minimize the integrated squared error 
criterion: ISE = E||/t(x) — H2- As a result, we can 
introduce a density learning problem as estimating a 
weight vector (3* G A+. The objective can then be 
written as follows (Bunea et al., 2010; Kim, 1995) 



P* 



G argmm • 
/3eA+ 



(17) 



{/3 T S/3 - c T /3} , 

where £ G R nxn with = k^(xW, x^), and 

Ci^VvfxWxW), (18) 
n — 1 A — ' 



While the combination of the — c T (3 term and the non- 
negativity constraint induces some sparsity, it may not 
be enough. To avoid overfitting or obtain interpretable 
results, one might control the level of solution sparsity 
(Bunca et al., 2010). In this context, we extend (17) 
to include cardinality constraints, i.e. j3* G n 



2 Quantum state tomography does not easily benefit 
from iterative eigenvalue solvers, since the range of .A* is 
not sparse. 
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Figure 3. Density estimation results using the Parzen method (left), the quadratic program (17) (left and middle-top), 
and our approach (middle-bottom and right). 
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Figure 4. Estimation results for different k: Red spikes depict the estimated kernel means as well as the their relative 
contribution to the Gaussian mixture. As k increases, the additional nonzero coefficients in f3* tend to have small weights. 



Numerical experiments: We consider the follow- 
ing Gaussian mixture: fi(x) — | Si=i K f; (Pi> x )> 
where c?i = (7/9)* and /3 i = 14((7j — 1). A sample of 
1000 points is drawn from fi(x). We compare the den- 
sity estimation performance of: (i) the Parzen method 
(Parzen, 1962), (ii) the quadratic programming for- 
mulation in (17), and (Hi) our cardinality-constrained 
version of (17) using GSSP. While fx(x) is constructed 
by kernels with various widths, we assume a constant 
width during the kernel estimation. In practice, the 
width is not known a priori but can be found using 
cross-validation techniques; for simplicity, we assume 
kernels with width <j = 1. 

Figure 3(left) depicts the true pdf and the estimated 
densities using the Parzen method and the quadratic 
programming approach. Moreover, the figure also in- 
cludes a scaled plot of l/<7£, indicating the height of 
the individual Gaussian mixtures. By default, the 
Parzen window method estimation interpolates 1000 
Gaussian kernels with centers around the sampled 
points to compute the estimate fi(x); unfortunately, 
neither the quadratic programming approach (as Fig- 
ure 3 (middle-top) illustrates) nor the Parzen window 
estimator results are easily interpretable even though 
both approaches provide a good approximation of the 
true pdf. 

Using our cardinality-constrained approach, we can 



significantly enhance the interpretability. This is be- 
cause in the sparsity-constrained approach, we can 
control the number of estimated Gaussian compo- 
nents. Hence, if the model order is known a priori, 
the non-convex approach can be extremely useful. 

To see this, we first show the coefficient profile of the 
sparsity based approach for k = 5 in Figure 3 (middle- 
bottom). Figure 3 (right) shows the estimated pdf 
for k = 5 along with the positions of weight coef- 
ficients obtained by our sparsity enforcing approach. 
Note that most of the weights obtained concentrate 
around the true means, fully exploiting our prior in- 
formation about the ingredients of (i(x) — this happens 
with rather high frequency in the experiments we con- 
ducted. Figure 4 illustrates further estimated pdf's 
using our approach for various k. Surprisingly, the re- 
sulting solutions are still approximately 5-sparse even 
if k > 5, as the over-estimated coefficients are ex- 
tremely small, and hence the sparse estimator is rea- 
sonably robust to inaccurate estimates of k. 

8. Application: Portfolio optimization 

Problem: Given a sample covariance matrix S and 
expected mean \i, the return-adjusted Markowitz 
mean-variance (MV) framework selects a portfolio (3* 

such that (3* € argmin SgA + < /3 T S/3 — Tfi T (3 > , where 
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encodes the normalized capital constraint, and 
t trades off risk and return (Brodie et al., 2009; 
DeMiguel et al., 2009). The solution (3* e is the 
distribution of investments over the p available assets. 

While such solutions construct portfolios from scratch, 
a more realistic scenario is to incrementally adjust an 
existing portfolio as the market changes. Due to costs 
per transaction, we can naturally introduce cardinality 
constraints. In mathematical terms, let /3 € K p be the 
current portfolio selection. Given /3, we seek to adjust 
the current selection (3 = j3 + Sp such that ||<5/3||o < k. 
This leads to the following optimization problem: 

8* p e argmin (fi + S p ) T T.[p + dp) - Tfi T (f3 + dp), 

where A is the level of update, and k controls the trans- 
actions costs. During an update, A = would keep the 
portfolio value constant while A > would increase it. 
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Figure 5. Relative error \\/3 — /3*||2/||/3*||2 comparison as 
a function of m: Approach 1 is the non-convex approach 
(3), and approach 2 is (19). Each point corresponds to the 
median value of 30 Monte-Carlo realizations. 



Numerical experiments: To clearly highlight the 
impact of the non-convex projector, we create a syn- 
thetic portfolio update problem, where we know the 
solution. As in (Brodie et al., 2009), we cast this prob- 
lem as a regression problem and synthetically generate 
y = Xf3* where p = 1000 such that /3* € A\ (A is cho- 
sen randomly), and ||/3*||o = k for k = 100. 

Since in general we do not expect RIP assumptions to 
hold in portfolio optimization, our goal here is to refine 
the sparse solution of a state-of-the-art convex solver 
via (3) in order to accommodate the strict sparsity and 
budget constraints. Hence, we first consider the basis 
pursuit criterion and solve it using SPGL1 (van den 
Berg & Friedlander, 2008): 



minimize ||/3|| i s.t. 



X 



= 



y 



(19) 



The normalization by 1/y/p in the last equality gives 
the constraint matrix a better condition number, since 
otherwise it is too ill-conditioned for a first-order 
solver. 

Almost none of the solutions to (19) return a fc-sparse 
solution. Hence, we initialize (3) with the SPGL1 so- 
lution to meet the constraints. The update step in (3) 
uses the GSHP algorithm. 

Figure 5 shows the resulting relative errors ||/3 — 
/3*||2/||/3*||2- We see that not only does (3) return a 
/c-sparse solution, but that this solution is also closer 
to /3 , particularly when the sample size is small. As 
the sample size increases, the knowledge that j3* is 
/c-sparse makes up a smaller percentage of what we 
know about the signal, so the gap between (19) and 
(3) diminishes. 



9. Conclusions 

While non-convexity in learning algorithms is unde- 
sirable according to conventional wisdom, avoiding it 
might be difficult in many problems. In this setting, we 
show how to efficiently obtain exact sparse projections 
onto the positive simplex and its hyperplane exten- 
sion. We empirically demonstrate that our projectors 
provide substantial accuracy benefits in quantum to- 
mography from fewer measurements and enable us to 
exploit prior non-convex knowledge in density learn- 
ing. Moreover, we also illustrate that we can refine 
the solution of well-established state-of-the-art convex 
sparse recovery algorithms to enforce non-convex con- 
straints in sparse portfolio updates. The quantum to- 
mography example in particular illustrates that the 
non-convex solutions can be extremely useful; here, the 
non-convexity appears milder, since a fixed-rank ma- 
trix still has extra degrees of freedom from the choice 
of its eigenvectors. 
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